Suppose that instead of regression, our goal is classification. In this case $y_i$ is no longer numeric. $\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)\right\}$
To initially quantify the accuracy of our classification $\hat{f}$ we regard the training error rate: $$ \frac{1}{n} \sum_{i=1}^{n} I\left(y_{i} \neq \hat{y}_{i}\right) $$ where $\hat{y}_{i}$ is the predicted class for the $i'th$ observation using our prediction from $\hat{f}$ and $I\left(y_{i} \neq \hat{y}_{i}\right)$ is the indicator function, taking the value one if $y_{i} \neq \hat{y}_{i}$ and zero $\text { if } y_{i}=\hat{y}_{i}$
Analogously regression setting: most interested in the error rates that result from applying our classifier to test observations that were not used in training.
The test error rate associated with a set of test observations of the form $\left(x_{0}, y_{0}\right)$ $$ \operatorname{Ave}\left(I\left(y_{0} \neq \hat{y}_{0}\right)\right) $$
Possible to show: test error rate given above is minimized, on average, by a very simple classifier that assigns each observation to the most likely class given its predictor values.
Assign a test observation with predictor vector $x_0$ to the class $j$ for which \begin{equation}\label{eq:bayes} \operatorname{Pr}\left(Y=j|X= x_{0}\right) \end{equation}
a conditional probability, i.e. the probability conditional that $Y = j$, given the observed predictor vector $x_0$.
Simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in which a test observation will be assigned to the orange class, and the blue background grid indicates the region in which a test observation will be assigned to the blue class.
The Bayes classifier produces the lowest possible test error rate, since the Bayes classifier will always choose the class for which equation (\ref{eq:bayes}) above is largest, the error rate at $X = x_0$ will be $1−max_j Pr(Y = ratej|X = x_0)$. In general, the overall Bayes error rate is given by
$$1-E\left(\max _{j} \operatorname{Pr}(Y=j | X)\right)$$where the expectation averages the probability over all possible values of $X$. For our simulated data, the Bayes error rate is 0.1304. It is greater than zero, because the classes overlap in the true population so $\max _{j} \operatorname{Pr}(Y=j|X=x_{0} )<1$ for some values of $x_0$.
In theory: we would always like to predict qualitative responses using the Bayes classifier. But: for real data, we do not know the conditional distribution of $Y$ given $X$ $\rightarrow$ computing the Bayes classifier is impossible.
Many approaches attempt to estimate the conditional distribution of $Y$ given $X$, and then classify a given observation to the class with highest estimated probability.
One such method: K-nearest neighbors (KNN) classifier. Given a positive integer $K$ and a test observation $x_0$, KNN classifier first identifies the K points in the training data that are closest to $x_0$, represented by $\mathcal{N}_{0}$.
It then estimates the conditional probability for class $j$ as the fraction of points in $\mathcal{N}_{0}$ whose response values equal $j$: $$ \operatorname{Pr}\left(Y=j | X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I\left(y_{i}=j\right) $$
Finally, KNN applies Bayes rule and classifies the test observation $x_0$ to the class with the largest probability.
In many classification problems the response variable is qualitative, instead of quantitative.
Predicting qualitative responses is called classification.
Predicting a qualitative response for an observation: classifying that observation.
On the other hand:
Predict the probability of each of the categories of a qualitative variable, as basis for classification.
Just as in the regression setting: we have a set of training observations $(x_1, y_1), . . . , (x_n, y_n)$ that we can use to build a classifier.
Classifier should perform well not only on the training data, but on test observations that were not used to train the classifier.
To illustrate: the simulated Default data set:
How to build a model to predict default ($Y$) for any given value of balance ($X_1$) and income ($X2$).
We have stated that linear regression is not appropriate in the case of a qualitative response. Why not?
Suppose: predict the medical condition of a patient on the basis of her symptoms.
There are three possible diagnoses: stroke, drug overdose, and epileptic seizure. Could encode these values as a quantitative response variable, $Y$:
\begin{equation} Y=\left\{\begin{array}{ll}{1} & {\text { if stroke }} \\ {2} & {\text { if drug overdose }} \\ {3} & {\text { if epileptic seizure. }}\end{array}\right. \end{equation}Now, use least squares to fit a linear regression model to predict $Y$ on the basis of a set of predictors $X_1, . . .,Xp$.
But: Coding implies
What about binary variables: only two possibilities for the patient’s medical condition: stroke and drug overdose.
\begin{equation} Y=\left\{\begin{array}{ll}{0} & {\text { if stroke }} \\ {1} & {\text { if drug overdose }}\end{array}\right. \end{equation}Logistic regression models the probability that $Y$ belongs to a particular category.
For the Default data example logistic regression models the probability of default:
\begin{equation} \operatorname{Pr}(\text { default }=\text { Yes } | \text { balance })=p(balance)=[0,1] \end{equation}To make predictions, we might predict $default = Yes$ for any individual for whom $p(balance) > 0.5.$
In lin.reg, we model $p(X)$ directly
\begin{equation} p(X)=\beta_{0}+\beta_{1} X \end{equation}In logistic regression: use the logistic function: \begin{equation} p(X)=\frac{e^{\beta_{0}+\beta_{1} X}}{1+e^{\beta_{0}+\beta_{1} X}} \end{equation} Calculate the odds, with values between $0$ and $\infty$.
\begin{equation} \frac{p(X)}{1-p(X)}=e^{\beta_{0}+\beta_{1} X} \end{equation}How to interpret odds: Values of the odds close to $0$ and $\infty$ indicate very low/ very high probabilities of default.
For example, on average 1 in 5 people with an odds of $1/4$ will default, since $p(X) = 0.2$ implies an odds of $\frac{0.2}{1−0.2} = 1/4$.% Likewise on average nine out of every ten people with an odds of 9 will default, since $p(X) = 0.9$ implies an odds of $\frac{0.9}{1−0.9} = 9.$
Taking the logarithm on both sides yields the log-odds or logit.
\begin{equation} \log \left(\frac{p(X)}{1-p(X)}\right)=\beta_{0}+\beta_{1} X \end{equation}However, because the relationship between $p(X)$ and $X$ is not a straight line, $\beta_1$ does not correspond to the change in $p(X)$ associated with a one-unit increase in $X$.
The amount that $p(X)$ changes due to a one-unit change in $X$ will depend on the current value of $X$.
But regardless of the value of $X$, if $\beta_1$ is positive then increasing $X$ will be associated with increasing $p(X)$, and if $\beta_1$ is negative then increasing $X$ will be associated with decreasing $p(X)$.
##The logit transformation#
####Generate a vector of probabilities########
pi=seq(0,0.99999999,le=100)
logit<-function(x)
{
logit<-log(x/(1-x))
return(logit)
}
plot(logit(pi),pi,type="l", xlim=c(-5,5), main="The Logit Transformation",xlab="logit",ylab="probability ", cex.main=1.2, cex.lab=1,cex.axis=1)
#Maximum Likelihood estimation: Logit###
##General Syntax###
library(miscTools)#the maxLik package acts as a wrapper for the more basic "optim", the library miscTools is required.
library(maxLik)
loglike<-function(beta)#Define a function that takes only the parameter vector as arguments.
{
ll <- "my log likelihood function" #depending on your optimization routine,
#check whether you need the negative or the positive log likelihood!
return(ll)
}
#estim<-maxBFGS(loglike,finalHessian=TRUE,start=c(.,.))###initialize the optimization,
#pass on starting values and store the results in estim
#estim.par<-estim$estimate ### store the paramter estimates in a variable "estim.par"
###An empirical example for the logit model##
set.seed(40)
library(maxLik)
n <- 1000#sample size
beta0 <--2#coefficient on the intercept
beta1 <- 0.1#coefficient on the regressor
beta2<-1
beta<-cbind(beta0,beta1,beta2)
x <- sort(runif(n=n, min=18, max=60))#generating the regressor
x2<-rbinom(n,1,0.5)
pi_x <- exp(beta0 + beta1 * x+beta2*x2) / (1 + exp(beta0 + beta1 * x+beta2*x2))#generating the probabilities for the binomial
y <- rbinom(n=length(x), size=1, prob=pi_x)#drawing y with the transformed logit probabilities
data <- cbind(x, pi_x, y)
X<-cbind(rep(1, n),x,x2)
loglike<-function(beta)#the likelihood function for the logit model
{
ll <- sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta)))
return(ll)
}
estim<-maxBFGS(loglike,finalHessian=TRUE,start=c(0,1,1))#initialize estimation procedure.
estim.par<-estim$estimate# give out parameter estimates.
estim.par
estim.hess<-estim$hessian###the optimization routine returns the hessian matrix at the last iteration.
Cov<--(solve(estim.hess))##the covariance matrix is the (negative) inverse of the hessian matrix.
sde<-sqrt(diag(Cov))#the standard errors are the square root of the diagonal of the inverse Hessian.
sde
##How do we interpret the parameter estimates?#####
###marginal effect####
x2<-0
####marginal effects for beta1####
pi<-exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2))
dx<-estim.par[2]*(pi)*(1-pi)
####Calculate for a range of x values and x2=0###
dx<-c()
for(i in 1:length(x))
{
pi<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
dx[i]<-estim.par[2]*(pi)*(1-pi)
}
plot(x,dx,type="l")
####Calculate for a range of x values and x2=1###
x2<-1
dx1<-c()
for(i in 1:length(x))
{
pi<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
dx1[i]<-estim.par[2]*(pi)*(1-pi)
}
lines(x,dx1,col="red")
#####Predicted probabilities for beta2######
x2<-0
pi<-c()
for(i in 1:length(x))
{
pi[i]<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
#dx[i]<-estim.par[2]*(pi)*(1-pi)
}
plot(x,pi,type="l")
#####Predicted probabilities for beta2######
x2<-1
pi1<-c()
for(i in 1:length(x))
{
pi1[i]<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
#dx[i]<-estim.par[2]*(pi)*(1-pi)
}
lines(x,pi1,col="red")
We can extend this frame work easily to accommodate $p$-regressors as
\begin{equation}
p(X)=\frac{e^{\beta_{0}+\beta_{1} X_{1}+\cdots+\beta_{p} X_{p}}}{1+e^{\beta_{0}+\beta_{1} X_{1}+\cdots+\beta_{p} X_{p}}}
\end{equation}
To estimate $\beta_{0}, \beta_{1}, \ldots, \beta_{p}$ we use maximum likelihood.
But: how to interpret the coefficients?
Better: predictions
For example: student with a credit card balance of $1, 500 and an income of $40,000 has an estimated probability For example: student with a credit card balance of 1,500 and an income of 40,000 has an estimated probability of default of
\begin{equation} \hat{p}(X)=\frac{e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 1}}{1+e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 1}}=0.058 \end{equation}A non-student with the same balance and income has an estimated probability of default of
\begin{equation} \hat{p}(X)=\frac{e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 0}}{1+e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 0}}=0.105 \end{equation}##The logit transformation#
####Generate a vector of probabilities########
pi=seq(0,0.99999999,le=100)
logit<-function(x)
{
logit<-log(x/(1-x))
return(logit)
}
plot(logit(pi),pi,type="l", xlim=c(-5,5), main="The Logit Transformation",xlab="logit",ylab="probability ", cex.main=1.2, cex.lab=1,cex.axis=1)
set.seed(40)
library(maxLik)
n <- 1000#sample size
beta0 <--2#coefficient on the intercept
beta1 <- 0.1#coefficient on the regressor
beta2<-1
beta<-cbind(beta0,beta1,beta2)
x <- sort(runif(n=n, min=18, max=60))#generating the regressor
x2<-rbinom(n,1,0.5)
pi_x <- exp(beta0 + beta1 * x+beta2*x2) / (1 + exp(beta0 + beta1 * x+beta2*x2))#generating the probabilities for the binomial
y <- rbinom(n=length(x), size=1, prob=pi_x)#drawing y with the transformed logit probabilities
data <- cbind(x, pi_x, y)
X<-cbind(rep(1, n),x,x2)
loglike<-function(beta)#the likelihood function for the logit model
{
ll <- sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta)))
return(ll)
}
estim<-maxBFGS(loglike,finalHessian=TRUE,start=c(0,0,1))#initialize estimation procedure.
estim.par<-estim$estimate# give out parameter estimates.
estim.par
####Calculating standard errors of the coefficients#########
estim.hess<-estim$hessian###the optimization routine returns the hessian matrix at the last iteration.
Cov<--(solve(estim.hess))##the covariance matrix is the (negative) inverse of the hessian matrix.
sde<-sqrt(diag(Cov))#the standard errors are the square root of the diagonal of the inverse Hessian.
sde
###marginal effect####
x2<-0
####marginal effects for beta1####
pi<-exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2))
dx<-estim.par[2]*(pi)*(1-pi)
dx
###marginal effect####
x2<-1
####marginal effects for beta1####
pi<-exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[500]+estim.par[3]*x2))
dx<-estim.par[2]*(pi)*(1-pi)
dx
####Calculate for a range of x values and x2=0###
x2<-0
dx<-c()
for(i in 1:length(x))
{
pi<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
dx[i]<-estim.par[2]*(pi)*(1-pi)
}
plot(x,dx,type="l")
####Calculate for a range of x values and x2=1###
x2<-1
dx1<-c()
for(i in 1:length(x))
{
pi<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
dx1[i]<-estim.par[2]*(pi)*(1-pi)
}
lines(x,dx1,col="red")
#####Predicted probabilities for beta2######
x2<-0
pi<-c()
for(i in 1:length(x))
{
pi[i]<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
#dx[i]<-estim.par[2]*(pi)*(1-pi)
}
plot(x,pi,type="l")
#####Predicted probabilities for beta2######
x2<-1
pi1<-c()
for(i in 1:length(x))
{
pi1[i]<-exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2) / (1 + exp(estim.par[1] + estim.par[2] * x[i]+estim.par[3]*x2))
#dx[i]<-estim.par[2]*(pi)*(1-pi)
}
lines(x,pi1,col="red")
Previously: model the conditional distribution of $Y$, given $X$.
In LDA: model the distribution of the predictors $X$ separately in each of the response classes (i.e. given $Y$ ), and use Bayes’ theorem to flip these around into estimates for $\operatorname{Pr}(Y=k | X=x)$.
Suppose that we have $K$ classes with $K>2$.
Then Bayes' theorem states that
\begin{equation}\label{eq:bayes_theo} p_{k}(X)=\operatorname{Pr}(Y=k | X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)} \end{equation}
and we call $p_{k}(X)$ the posterior probability.
In order to estimate this: find estimates for
Suppose that $p=1$: assuming the normal density we can write
\begin{equation} f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left(-\frac{1}{2 \sigma_{k}^{2}}\left(x-\mu_{k}\right)^{2}\right) \end{equation}where $\mu_{k} \text { and } \sigma_{k}^{2}$ are the mean and variance parameters for the k'th class.
Further suppose that $\sigma_{1}^{2}=\ldots=\sigma_{K}^{2}$, we can write equation \ref{eq:bayes_theo}
\begin{equation}\label{eq:bayes_norm} p_{k}(x)=\frac{\pi_{k} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{k}\right)^{2}\right)}{\sum_{l=1}^{K} \pi_{l} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{l}\right)^{2}\right)} \end{equation}Bayes` classifier involves assigning an observation $X = x$ to the class for which equation \eqref{eq:bayes_norm} is largest. Taking the log of \eqref{eq:bayes_norm} and rearranging the terms, we can show that this is equivalent to assigning the observation to the class for which
\begin{equation}\label{eq:lda_delta_x} \delta_{k}(x)=x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right) \end{equation}is largest.
For instance, if $ K = 2$ and $\pi_1 = \pi_2$, then the Bayes` classifier assigns an observation to class 1 if $2x (\mu_1 − \mu_2) > \mu^2_1-\mu_2^2$, and to class 2 otherwise. In this case, the Bayes decision boundary corresponds to the point where
\begin{equation*} x=\frac{\mu_{1}^{2}-\mu_{2}^{2}}{2\left(\mu_{1}-\mu_{2}\right)}=\frac{\mu_{1}+\mu_{2}}{2} \end{equation*}Even if we were certain that $f_k$ is normal, we still need to estimate
\begin{equation} \begin{aligned} \hat{\mu}_{k} &=\frac{1}{n_{k}} \sum_{i : y_{i}=k} x_{i} \\ \hat{\sigma}^{2} &=\frac{1}{n-K} \sum_{k=1}^{K} \sum_{i : y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2} \end{aligned} \end{equation}where $n$ is the total number of training observations, and $n_k$ is the number of training observations in the $k'th$ class.
Without additional information about class membership, we estimate \begin{equation} \hat{\pi}_{k}=n_{k} / n \end{equation}
Finally,
\begin{equation} \hat{\delta}_{k}(x)=x \cdot \frac{\hat{\mu}_{k}}{\hat{\sigma}^{2}}-\frac{\hat{\mu}_{k}^{2}}{2 \hat{\sigma}^{2}}+\log \left(\hat{\pi}_{k}\right) \end{equation}where the discriminant factor $\delta$ is a linear function of $x$ (hence the name).
We assume that
$X=\left(X_{1}, X_{2}, \ldots, X_{p}\right)$ are $X \sim N(\mu, \Sigma)$ with a \emph{class-specific} multivariate mean vector and a common covariance matrix.
For $p > 1$ predictors: LDA classifier assumes that the observations in the $k'th$ class are drawn from a multivariate Gaussian distribution $N\left(\mu_{k}, \mathbf{\Sigma}\right)$ where $\mu_k$ is a class-specific mean vector and $\mathbf{\Sigma}$ is a covariance matrix common to all classes.
The resulting Bayes classifier can be shown to have the following form (after plugging the density function for the $k \text { th class }, f_{k}(X=x))$ into \ref{eq:bayes_theo}),i.e. the Bayes classifier assigns an observation $X = x$ to the class for which
\begin{equation}\label{eq:lda_mult} \delta_{k}(x)=x^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}+\log \pi_{k} \end{equation}is largest.
The decision boundaries represent the set of values for which $\delta_k(x)=\delta_l(x)$
\begin{equation} x^{T} \Sigma^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \Sigma^{-1} \mu_{k}=x^{T} \Sigma^{-1} \mu_{l}-\frac{1}{2} \mu_{l}^{T} \Sigma^{-1} \mu_{l} \end{equation}